National Repository of Grey Literature 4 records found  Search took 0.00 seconds. 
Dolování znalostí z textových dat související s migrační krizí
Koukal, Filip
This thesis focuses on the usage of machine learning techniques for knowledge mining from text data associated with the migrant crisis. Used data consists of articles and their comments downloaded from idnes.cz, an online news portal. This thesis explores the abilities of Word2Vec in relation to knowledge mining. A number of experiments that focus on the identification and characterization of topics embedded inside the downloaded articles were defined and carried out.
Rozpoznání pojmenovaných entit v textu
Süss, Martin
This thesis deals with the named entity recognition (NER) in text. It is realized by machine learning techniques. Recently, techniques for creating word embeddings models have been introduced. These word vectors can encode many useful relationships between words in text data, such as their syntactic or semantic similarity. Modern NER systems use these vector features for improving their quality. However, only few of them investigate in greater detail how much these vectors have impact on recognition and whether they can be optimized for even greater recognition quality. This thesis examines various factors that may affect the quality of word embeddings, and thus the resulting quality of the NER system. A series of experiments have been performed, which examine these factors, such as corpus quality and size, vector dimensions, text preprocessing techniques, and various algorithms (Word2Vec, GloVe and FastText) and their parameters. Their results bring useful findings that can be used within creation of word vectors and thus indirectly increase the resulting quality of NER systems.
Genres classification by means of machine learning
Bílek, Jan ; Neruda, Roman (advisor) ; Vomlelová, Marta (referee)
In this thesis, we compare the bag of words approach with doc2vec doc- ument embeddings on the task of classification of book genres. We cre- ate 3 datasets with different text lengths by extracting short snippets from books in Project Gutenberg repository. Each dataset comprises of more than 200000 documents and 14 different genres. For 3200-character documents, we achieve F1-score of 0.862 when stacking models trained on both bag of words and doc2vec representations. We also explore the relationships be- tween documents, genres and words using similarity metrics on their vector representations and report typical words for each genre. As part of the thesis, we also present an online webapp for book genre classification. 1
Analysis of stock market sentiment with social media
Čermák, Vojtěch ; Baruník, Jozef (advisor) ; Vacek, Pavel (referee)
In the thesis, we explored prospects of extracting sentiment contained in Twitter messages. We proposed novel approach consisting of directly predicting the volatility on stock market by features obtained from the text documents using suitable document representation. We compared the performance of standard document vectorisation methods as well as a novel approach based on aggregating word vectors created by word embeddings. We showed that direct modelling of a market variable is possible with most of the proposed vectorisation techniques. In particular, the strong predictive power of aggregated word embeddings suggests that they are excellent sentiment representation, because they are independent of message volume and they capture well the semantical information in the tweets. Besides, our findings suggest that aggregating word embeddings vectorisation is viable approach even for large documents.

Interested in being notified about new results for this query?
Subscribe to the RSS feed.